Distributed out-of-memory NMF on CPU/GPU architectures

نویسندگان

چکیده

Abstract We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed is based on prior work NMFk, which can perform automatic model selection and extract latent variables patterns from data. In this work, we extend NMFk by adding support dense sparse operation multi-node, multi-GPU resulting optimized problems where memory required to factorize a given greater than available GPU memory. Memory complexity reduced batching/tiling strategies, operations are significantly accelerated with cores (or tensor when available). Input/output latency associated batch copies between host device hidden using CUDA streams overlap data transfers compute asynchronously, collective communications (both intra-node inter-node) NVIDIA Collective Communication Library (NCCL) communicators. Benchmark results show significant improvement, 32X 76x speedup, new GPUs over CPU-based NMFk. Good weak scaling was demonstrated up 4096 cluster nodes approximately 25,000 decomposing 340 Terabyte-size 11 Exabyte-size density $$10^{-6}$$ 10 - 6 .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compiling for Distributed Memory Architectures

متن کامل

Performance Modeling of Distributed Memory Architectures

We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single source, and multiple source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multi-dimensional arrays, and emulation of butterry networks. We also show ...

متن کامل

Performance Modeling of Multithreaded Distributed Memory Architectures

In multithreaded distributed memory architectures, long{ latency memory operations and synchronization delays are tolerated by suspending the current thread and switching to another thread, which is executed concurrently with the long{latency operation of the suspended thread. Timed Petri nets are used to model several multithreaded architectures at the instruction and thread levels. Model eval...

متن کامل

Parallel rendering of volumetric data set on distributed-memory architectures

A solution is proposed to the problem of interactive visualization and rendering of volume data. Designed for parallel distributed memory MIMD architectures, the volume rendering system is based on the ray tracing (RT) visualization technique, the Sticks representation scheme (a data structure exploiting data coherence for the compression of classified datasets), the use of a slice-partitioning...

متن کامل

Computation of Dendrites on Parallel Distributed Memory Architectures

A code for simulating the solidi cation of a pure material from its undercooled melt based on a phase eld approach has been written for parallel distributed memory architectures using MPI. The numerical scheme is based on nite di erences and results in large sparse non-linear systems which are solved by a backtracking line search modi cation of Newton's method combined with GMRES. Experiments c...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: The Journal of Supercomputing

سال: 2023

ISSN: ['0920-8542', '1573-0484']

DOI: https://doi.org/10.1007/s11227-023-05587-4